feat: add partition artifacts for external vector backends#6463
Open
feat: add partition artifacts for external vector backends#6463
Conversation
# Conflicts: # rust/lance/src/index/vector/builder.rs
Contributor
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a partition-addressable intermediate layout (a "partition artifact") that external IVF_PQ backends can produce and Lance can consume directly, replacing a lossy hand-off via a generic dataset.
Background
External backends such as
pylance-cuvsalready do the expensive work of an IVF_PQ build: they assign each row to a partition and encode PQ codes for the full dataset. Today the only way to hand that result back to Lance is to materialize it as a generic dataset; the finalizer then re-scans and re-groups the rows by partition, throwing away the partitioning the backend just computed.Layout
An artifact is a small manifest plus a fixed number of bucketed Lance files:
Each row is
row_id + part_id + pq_code, routed bybucket = part_id % num_buckets. The manifest records, per logical partition, the file it lives in and the(offset, num_rows)ranges inside that file:At finalize time, Lance reads only the recorded ranges for the partition it is building.
Write path
The writer is streaming and bounded in memory. For each input batch it routes rows into per-bucket in-memory buffers; when a buffer fills it is sorted by
part_id, appended to the bucket file, and the covered ranges are recorded in the manifest. There is no second read/sort/rewrite pass atfinish().Changes
Rust:
PartitionArtifactBuilderandPartitionArtifactShuffleReaderprecomputed_partition_artifact_uriplumbed into the existing vector finalization flowPython:
accelerator="cuvs"becomes a thin runtime delegation to the externalpylance-cuvsbackend. No CUDA/cuVS code lives in this tree.Non-goals
The on-disk IVF_PQ format, finalizer semantics, and the CPU build path are unchanged. This PR only adds a new input boundary for external backends.